Search CORE

16 research outputs found

Minimax Optimal Estimation of Stability Under Distribution Shift

Author: Glynn Peter W.
Ma Yuanzhe
Namkoong Hongseok
Publication venue
Publication date: 12/12/2022
Field of study

The performance of decision policies and prediction models often deteriorates when applied to environments different from the ones seen during training. To ensure reliable operation, we propose and analyze the stability of a system under distribution shift, which is defined as the smallest change in the underlying environment that causes the system's performance to deteriorate beyond a permissible threshold. In contrast to standard tail risk measures and distributionally robust losses that require the specification of a plausible magnitude of distribution shift, the stability measure is defined in terms of a more intuitive quantity: the level of acceptable performance degradation. We develop a minimax optimal estimator of stability and analyze its convergence rate, which exhibits a fundamental phase shift behavior. Our characterization of the minimax convergence rate shows that evaluating stability against large performance degradation incurs a statistical cost. Empirically, we demonstrate the practical utility of our stability framework by using it to compare system designs on problems where robustness to distribution shift is critical

arXiv.org e-Print Archive

Diagnosing Model Performance Under Distribution Shift

Author: Cai Tiffany Tianhui
Namkoong Hongseok
Yadlowsky Steve
Publication venue
Publication date: 09/03/2023
Field of study

Prediction models can perform poorly when deployed to target distributions different from the training distribution. To understand these operational failure modes, we develop a method, called DIstribution Shift DEcomposition (DISDE), to attribute a drop in performance to different types of distribution shifts. Our approach decomposes the performance drop into terms for 1) an increase in harder but frequently seen examples from training, 2) changes in the relationship between features and outcomes, and 3) poor performance on examples infrequent or unseen during training. These terms are defined by fixing a distribution on

X

while varying the conditional distribution of

Y \mid X

between training and target, or by fixing the conditional distribution of

Y \mid X

while varying the distribution on

X

. In order to do this, we define a hypothetical distribution on

X

consisting of values common in both training and target, over which it is easy to compare

Y \mid X

and thus predictive performance. We estimate performance on this hypothetical distribution via reweighting methods. Empirically, we show how our method can 1) inform potential modeling improvements across distribution shifts for employment prediction on tabular census data, and 2) help to explain why certain domain adaptation methods fail to improve model performance for satellite image classification

arXiv.org e-Print Archive

Modeling Interference Using Experiment Roll-out

Author: Boyarsky Ariel
Namkoong Hongseok
Pouget-Abadie Jean
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/05/2023
Field of study

Experiments on online marketplaces and social networks suffer from interference, where the outcome of a unit is impacted by the treatment status of other units. We propose a framework for modeling interference using a ubiquitous deployment mechanism for experiments, staggered roll-out designs, which slowly increase the fraction of units exposed to the treatment to mitigate any unanticipated adverse side effects. Our main idea is to leverage the temporal variations in treatment assignments introduced by roll-outs to model the interference structure. We first present a set of model identification conditions under which the estimation of common estimands is possible and show how these conditions are aided by roll-out designs. Since there are often multiple competing models of interference in practice, we then develop a model selection method that evaluates models based on their ability to explain outcome variation observed along the roll-out. Through simulations, we show that our heuristic model selection method, Leave-One-Period-Out, outperforms other baselines. We conclude with a set of considerations, robustness checks, and potential limitations for practitioners wishing to use our framework

arXiv.org e-Print Archive